We hope to explore the relative influence of physical traits, environmental conditions and species identity on the growth rate of trees. A gradient boosted model seems like a good candidate for this work since they:
We, first, converted the environmental variables to principle
components as they were highly correlated. We visualized the PCA and
used the eginvectors to help figure which environmental condition best
explained that PC. There were 5 -
Soil Fertility, Light, Temperature, pH, Soil.Humidity.Depth, and Slope.
We want to ensure that the plant traits are not correlated. Past work suggests that they are not easily represented using a PCA. So, we will not use the this feature reduction method.
A gradient boosted machine/model is a machine learning model that uses decision trees to fit the data.
A decision tree first starts with all of the observations, then, from the variables provided, it tries to figure out which variable split would result in the “purest” groupings of the data. So, in this case, it would try to place rows with higher growth rates in one node, and those with lower growth rates in another node.
GBMs are an ensemble of decision trees, nut they are fit
sequentially. We call GBMs an ensemble of weak learners as each
subsequent tree is an attempt to correct the errors of the previous
tree. Thus, while one tree, by itself, can not describe the
relationships, with the use of all the trees, we can. Below is a figure
by Bradly Bohemke that attempts to illustrate how each subsequent tree
improves the fit on the data.
We compared the fit of three used a gradient boosted models to determine how environmental gradients and physical traits influence RGR:
First, we look at the best parameters from tuning.
## $model_id
## [1] "final_grid_model_57"
##
## $training_frame
## [1] "train.hex"
##
## $validation_frame
## [1] "valid.hex"
##
## $score_tree_interval
## [1] 10
##
## $ntrees
## [1] 10000
##
## $max_depth
## [1] 11
##
## $min_rows
## [1] 1
##
## $nbins
## [1] 16
##
## $nbins_cats
## [1] 256
##
## $stopping_rounds
## [1] 5
##
## $stopping_metric
## [1] "MSE"
##
## $stopping_tolerance
## [1] 1e-04
##
## $max_runtime_secs
## [1] 3395.824
##
## $seed
## [1] 1234
##
## $learn_rate
## [1] 0.05
##
## $learn_rate_annealing
## [1] 0.99
##
## $distribution
## [1] "gaussian"
##
## $sample_rate
## [1] 0.49
##
## $col_sample_rate
## [1] 0.42
##
## $col_sample_rate_per_tree
## [1] 0.34
##
## $min_split_improvement
## [1] 1e-08
##
## $histogram_type
## [1] "UniformAdaptive"
##
## $categorical_encoding
## [1] "Enum"
##
## $calibration_method
## [1] "PlattScaling"
##
## $x
## [1] "Soil.Fertility" "Light" "Temperature" "pH" "Slope"
## [6] "Estem" "Branching.Distance" "Stem.Wood.Density" "Leaf.Area" "LMA"
## [11] "LCC" "LNC" "LPC" "d15N" "t.b2"
## [16] "Ks" "Ktwig" "Huber.Value" "X.Lum" "VD"
## [21] "X.Sapwood" "d13C" "Tree.Age" "julian.date.2011"
##
## $y
## [1] "BAI_GR"
Now, we can build the model.
set.seed(123)
gbm_regressor_bai_residuals <-
gbm(BAI_GR ~ .,
data =
rgr_msh_na %>% filter(Group == "Train")%>% filter(!is.na(BAI_GR))%>%
select(any_of(c(EnvironmentalVariablesKeep, PlantTraitsKeep, "Tree.Age", "BAI_GR", "julian.date.2011"))),
n.trees = 1000,
interaction.depth = 11, #max depth
shrinkage = 0.05, #learning rate
n.minobsinnode = 10, #col_sample_rate
bag.fraction = 0.49, # sample_rate,
verbose = FALSE,
n.cores = NULL,
cv.folds = 5)First, we look at the importance of variables in the model.
Assessing how, when we hold everything else constant, what the relationships are between growth rate and the predictor.
How does the model perform when we use the true individual trait value?
Let’s explore the interactions in these data.
##
## Kruskal-Wallis rank sum test
##
## data: Value by Class
## Kruskal-Wallis chi-squared = 21.524, df = 3, p-value = 8.194e-05
## Comparison Z P.unadj
## 1 Environmental Conditions:Environmental Conditions - Environmental Conditions:Plant Traits 0.5730733 5.665950e-01
## 2 Environmental Conditions:Environmental Conditions - Plant Traits:Environmental Conditions -2.9581361 3.095055e-03
## 3 Environmental Conditions:Plant Traits - Plant Traits:Environmental Conditions -1.7512647 7.990033e-02
## 4 Environmental Conditions:Environmental Conditions - Plant Traits:Plant Traits -4.2940685 1.754283e-05
## 5 Environmental Conditions:Plant Traits - Plant Traits:Plant Traits -2.2793383 2.264696e-02
## 6 Plant Traits:Environmental Conditions - Plant Traits:Plant Traits -1.4936179 1.352755e-01
## P.adj
## 1 0.566595039
## 2 0.015475273
## 3 0.239700984
## 4 0.000105257
## 5 0.090587842
## 6 0.270551058
Now, we plot interactions with values>0.10.
Finally, we compare the relative importance of the various groups - tree age, plant traits, and environmental conditions.
First, we look at the best parameters from tuning.
## $model_id
## [1] "final_grid_model_64"
##
## $training_frame
## [1] "train.hex"
##
## $validation_frame
## [1] "valid.hex"
##
## $score_tree_interval
## [1] 10
##
## $ntrees
## [1] 10000
##
## $max_depth
## [1] 9
##
## $min_rows
## [1] 4
##
## $nbins
## [1] 1024
##
## $nbins_cats
## [1] 32
##
## $stopping_rounds
## [1] 5
##
## $stopping_metric
## [1] "MSE"
##
## $stopping_tolerance
## [1] 1e-04
##
## $max_runtime_secs
## [1] 3488.306
##
## $seed
## [1] 1234
##
## $learn_rate
## [1] 0.05
##
## $learn_rate_annealing
## [1] 0.99
##
## $distribution
## [1] "gaussian"
##
## $sample_rate
## [1] 0.74
##
## $col_sample_rate
## [1] 0.94
##
## $col_sample_rate_per_tree
## [1] 0.99
##
## $min_split_improvement
## [1] 0
##
## $histogram_type
## [1] "RoundRobin"
##
## $categorical_encoding
## [1] "Enum"
##
## $calibration_method
## [1] "PlattScaling"
##
## $x
## [1] "Soil.Fertility" "Light" "Temperature" "pH" "Slope" "Species"
## [7] "Tree.Age" "julian.date.2011"
##
## $y
## [1] "BAI_GR"
Now, we can build the model.
set.seed(123)
gbm_regressor_bai_residuals_species <-
gbm(BAI_GR ~ .,
data =
rgr_msh_na %>% filter(Group == "Train")%>% filter(!is.na(BAI_GR))%>%
select(any_of(c(EnvironmentalVariablesKeep, "Species", "Tree.Age", "BAI_GR", "julian.date.2011"))) %>%
mutate(Species = factor(Species)),
n.trees = 1000,
interaction.depth = 9, # max depth
shrinkage = 0.05, #learning rate
n.minobsinnode = 7, #col_sample_rate
bag.fraction = 0.74, # sample_rate,
verbose = FALSE,
n.cores = NULL,
cv.folds = 5)First, we look at the importance of variables in the model.
Assessing how, when we hold everything else constant, what the relationships are between growth rate and the predictor.
How does the model perform?
Let’s explore the interactions in these data.
##
## Kruskal-Wallis rank sum test
##
## data: Value by Class
## Kruskal-Wallis chi-squared = 3.6714, df = 3, p-value = 0.2992
Now, we plot interactions with values>0.10.
Finally, we compare the relative importance of the various groups - tree age, plant traits, and environmental conditions.
First, we look at the best parameters from tuning.
## $model_id
## [1] "final_grid_model_61"
##
## $training_frame
## [1] "train.hex"
##
## $validation_frame
## [1] "valid.hex"
##
## $score_tree_interval
## [1] 10
##
## $ntrees
## [1] 10000
##
## $max_depth
## [1] 9
##
## $min_rows
## [1] 4
##
## $nbins
## [1] 16
##
## $nbins_cats
## [1] 64
##
## $stopping_rounds
## [1] 5
##
## $stopping_metric
## [1] "MSE"
##
## $stopping_tolerance
## [1] 1e-04
##
## $max_runtime_secs
## [1] 3510.497
##
## $seed
## [1] 1234
##
## $learn_rate
## [1] 0.05
##
## $learn_rate_annealing
## [1] 0.99
##
## $distribution
## [1] "gaussian"
##
## $sample_rate
## [1] 0.49
##
## $col_sample_rate
## [1] 0.86
##
## $col_sample_rate_per_tree
## [1] 0.44
##
## $min_split_improvement
## [1] 1e-06
##
## $histogram_type
## [1] "QuantilesGlobal"
##
## $categorical_encoding
## [1] "Enum"
##
## $calibration_method
## [1] "PlattScaling"
##
## $x
## [1] "Soil.Fertility" "Light" "Temperature" "pH" "Slope"
## [6] "Estem" "Branching.Distance" "Stem.Wood.Density" "Leaf.Area" "LMA"
## [11] "LCC" "LNC" "LPC" "d15N" "t.b2"
## [16] "Ks" "Ktwig" "Huber.Value" "X.Lum" "VD"
## [21] "X.Sapwood" "d13C" "Species" "Tree.Age" "julian.date.2011"
##
## $y
## [1] "BAI_GR"
Now, we can build the model.
set.seed(123)
gbm_regressor_baiSpeciesAgeEP <-
gbm(BAI_GR ~ .,
data =
rgr_msh_na %>% filter(Group == "Train")%>% filter(!is.na(BAI_GR))%>%
select(any_of(c(EnvironmentalVariablesKeep, PlantTraitsKeep,"Species" ,
"Tree.Age", "BAI_GR", "julian.date.2011")))%>%
mutate(Species = factor(Species)),
n.trees = 1000,
interaction.depth = 9, #max depth
shrinkage = 0.05, #learning rate
n.minobsinnode = 21, #col_sample_rate
bag.fraction = 0.49, # sample_rate,
verbose = FALSE,
n.cores = NULL,
cv.folds = 5)First, we look at the importance of variables in the model.
Assessing how, when we hold everything else constant, what the relationships are between growth rate and the predictor.
How does the model perform?
Let’s explore the interactions in these data.
##
## Kruskal-Wallis rank sum test
##
## data: Value by Class
## Kruskal-Wallis chi-squared = 32.442, df = 3, p-value = 4.224e-07
Now, we plot interactions with values>0.10
Finally, we compare the relative importance of the various groups - tree age, plant traits, and environmental conditions.